5 research outputs found

    Optimizing computation-communication overlap in asynchronous task-based programs

    Get PDF
    Asynchronous task-based programming models are gaining popularity to address the programmability and performance challenges in high performance computing. One of the main attractions of these models and runtimes is their potential to automatically expose and exploit overlap of computation with communication. However, we find that inefficient interactions between these programming models and the underlying messaging layer (in most cases, MPI) limit the achievable computation-communication overlap and negatively impact the performance of parallel programs. We address this challenge by exposing and exploiting information about MPI internals in a task-based runtime system to make better task-creation and scheduling decisions. In particular, we present two mechanisms for exchanging information between MPI and a task-based runtime, and analyze their trade-offs. Further, we present a detailed evaluation of the proposed mechanisms implemented in MPI and a task-based runtime. We show performance improvements of up to 16.3% and 34.5% for proxy applications with point-to-point and collective communication, respectively.Peer ReviewedPostprint (author's final draft

    Implicit transactional memory in kilo-instruction multiprocessors

    Get PDF
    Although they have been the main server technology for many years, multiprocessors are undergoing a renaissance due to multi-core chips and the attractive scalability properties of combining a number of such multi-core chips into a system. The widespread use of multiprocessor systems will make performance losses due to consistency models and synchronization styles of popular programming models even more evident than they already are. Known architectural approaches to combat these losses are generally too complex, too specialized, or not transparent to software. In this article, we introduce implicit transactional memory as a generalized architectural concept to remove unnecessary performance losses caused by consistency models and synchronization styles. We show how the concept of implicit transactions can be implemented with low complexity by leveraging the multi-checkpoint mechanism of the Kilo-Instruction Processor. By relying on a general speculation substrate, this method supports even the strictest consistency model – sequential consistency – potentially as effectively as weaker models and it allows multiple threads to speculatively execute critical sections, beyond barriers and event synchronizations.Postprint (published version

    BST: A BookSim-based toolset to simulate NoCs with single- and multi-hop bypass

    Get PDF
    Network-on-Chips are a critical part of modern multiprocessors and their relevance will grow with the number of cores. The development of future NoC designs relies on detailed simulation models that accurately estimate their performance, power and hardware cost. Bypass routers are promising proposals due to their improved performance. Bypass routers reduce latency thanks to a combination of speculation, pre-routing (lookahead routing) and buffer bypass, which also reduce energy consumption by avoiding unnecessary buffer writes and reads. Multi-hop bypass NoCs, known as SMART, even bypass the crossbar of multiple routers in a single cycle. However, publicly available NoC simulators, such as BookSim or Garnet, do not implement bypass mechanisms or do not model them accurately. In this work, we present Bypass Simulation Toolset (BST), a set of tools to accurately simulate NoCs with single- and multi-hop bypass routers. BST combines and extends several simulation tools: an extension of BookSim with state-of-theart cycle-accurate bypass router models and additional flow control mechanisms; an RTL implementation of multi-hop bypass mechanisms based on OpenSMART; an API to ease a modular integration of the BST NoC simulator in full system simulators; and a set of scripts to automate simulation execution and data collection. To showcase BST, we i) validate BookSim SMART models with the RTL implementation; ii) compare bypass and traditional nonbypass router models; iii) integrate BookSim in gem5 using the proposed API and compare it with gem5’s Simple and Garnet 2.0 NoC models; and iv) present a case study evaluating different combinations of router types and topologies recently proposed for NoCs, highlighting the flexibility of the BST toolset.This work was supported by the Spanish Ministry of Science, Innovation and Universities, contracts TIN2015-65316- P, TIN2016-76635-C2-2-R (AEI/FEDER, UE) and PID2019- 105660RB-C22; the European Union’s Horizon 2020 research and innovation program under the Mont-Blanc 2020 project (grant agreement 779877); and the HiPEAC Network of Excellence. I. Perez is partially supported by an FPI grant, BES- ´ 2017-079971. M. Moreto has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship number RYC-2016-21104. Bluespec Inc. provided access to Bluespec tools.Peer ReviewedPostprint (author's final draft

    Optimizing computation-communication overlap in asynchronous task-based programs

    No full text
    Asynchronous task-based programming models are gaining popularity to address the programmability and performance challenges in high performance computing. One of the main attractions of these models and runtimes is their potential to automatically expose and exploit overlap of computation with communication. However, we find that inefficient interactions between these programming models and the underlying messaging layer (in most cases, MPI) limit the achievable computation-communication overlap and negatively impact the performance of parallel programs. We address this challenge by exposing and exploiting information about MPI internals in a task-based runtime system to make better task-creation and scheduling decisions. In particular, we present two mechanisms for exchanging information between MPI and a task-based runtime, and analyze their trade-offs. Further, we present a detailed evaluation of the proposed mechanisms implemented in MPI and a task-based runtime. We show performance improvements of up to 16.3% and 34.5% for proxy applications with point-to-point and collective communication, respectively.Peer Reviewe

    Implicit transactional memory in kilo-instruction multiprocessors

    No full text
    Although they have been the main server technology for many years, multiprocessors are undergoing a renaissance due to multi-core chips and the attractive scalability properties of combining a number of such multi-core chips into a system. The widespread use of multiprocessor systems will make performance losses due to consistency models and synchronization styles of popular programming models even more evident than they already are. Known architectural approaches to combat these losses are generally too complex, too specialized, or not transparent to software. In this article, we introduce implicit transactional memory as a generalized architectural concept to remove unnecessary performance losses caused by consistency models and synchronization styles. We show how the concept of implicit transactions can be implemented with low complexity by leveraging the multi-checkpoint mechanism of the Kilo-Instruction Processor. By relying on a general speculation substrate, this method supports even the strictest consistency model – sequential consistency – potentially as effectively as weaker models and it allows multiple threads to speculatively execute critical sections, beyond barriers and event synchronizations
    corecore